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Abstract 

The cooperative behaviour of interacting neurons and synapses is 
studied using models and methods from statistical physics. The com- 
petition between training error and entropy may lead to discontinu- 
ous properties of the neural network. This is demonstrated for a few 
examples: Perceptron, associative memory, learning from examples, 
generalization, multilayer networks, structure recognition, Bayesian 
estimate, on-line training, noise estimation and time series genera- 
tion. 

1 Introduction 

Since about 15 years there exists a wave of interdisciplinary research activities 
under the topic "neural networks". Neurobiologists, computer scientists, 
mathematicians, physicists, psychologists, and linguists are making a more 
or less common effort to understand the cooperative properties of a system of 
interacting neurons [Hertz et al 1991]. Meanwhile, the initial excitement and 
exaggerated promises have been replaced by practical research programs, but 
much has been achieved and many interesting and unexpected results have 
been obtained. 
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The research on neural networks may be classified into three objectives: 

1. Neurobiology: The material basis of our brain are about 10^^ neurons, 
each of which is directly connected to about 10^ other ones via synapses. 
We know a lot about these single units and their anatomical and func- 
tional organization. However, we are still far away from understanding 
learning, association, memory, recognition and generalization on the 
basis of interacting neurons and their synaptic plasticity. It may be a 
philosophical problem whether mind, soul, creativity and consciousness 
can be understood by collective properties of a system of nerve cells. 
But there is a good chance to elucidate the basic properties of a real 
neural network by simple models. 

2. Computer science: There exists a variety of algorithms which use con- 
cepts from real neural networks. Simple units represent information 
and interact by synaptic weights. Such systems are trained by a set of 
examples. After the training phase, in which the synaptic weights are 
adapted to the presented examples, the network is able to achieve a 
knowledge about the rule which produced the examples; it can gener- 
alize. These algorithms arc called neural networks or ncurocomputer; 
they are presently applied to a large variety of problems in engineer- 
ing, science and economy. They have several advantages compared to 
standard approaches, and there is hope to solve problems by neural 
networks which are too hard for methods of rule and data based algo- 
rithms of artificial intelligence. 

3. Physics: Neural networks definitely belong to the class of complex sys- 
tems, which arc characterized by nonlinear dynamics, feedback and 
macroscopic properties emerging from a huge number of interacting 
units. In general, physics is interested in understanding such systems 
in terms of mathematical relations, scaling laws, phase transitions etc. 

In physics mathematical modelling of nature has been very successful. How- 
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ever, it is not clear at all whether such a complex system like the brain 
can be described by a mathematical language, by simple relations between 
macroscopic functions and microscopic mechanisms. 

On the other side, the full quantum mechanical description of an iron solid 
is not possible, either. Nevertheless one gains a lot of insight into the spon- 
taneous magnetic ordering below a critical temperature if one studies the 
Ising model, which replaces the rather complex iron atom by a simple bi- 
nary unit interacting with its neighbors. With this analogy it is definitely 
useful to investigate simple units, which model a few essential mechanisms 
of neurons and synapses, and to study the cooperative behaviour of such 
interacting units. It is not obvious at all, whether such a system can store 
an infinite number of patterns with one set of synapses, learn from examples 
and generalize. Many questions can only be answered from a mathematical 
calculation. 

In this talk I want to emphasize the contribution of statistical physics to the 
theory of neural computation. Using models and methods from the physics 
of condensed matter one has been able to calculate the properties of neu- 
ral networks. This research program uses methods developed already at the 
beginning of this century by L. Boltzmann and J. W. Gibbs. In 1975 S. F. Ed- 
wards and P. W. Anderson, S. Kirkpatrick and D. Sherrington developed a 
theory of spin glasses [Fischer and Hertz 1991] which was extended to neural 
networks by J. J. Hopfield [1982]. The first analytic solution of attractor 
networks succeeded in 1985 [Amit et al 1987]. The statistical mechanics of 
learning from examples was pioneered by the late E. Gardner [1988]. These 
approaches opened a new field of research, which produced a lot of interest- 
ing results [for reviews see Watkin et al 1993, Opper and Kinzel 1996]. In 
view of the big challenge to understand the brain, the statistical physics of 
neural networks will definitely survive over the next century. 

Long before the approach of statistical mechanics, mathematical models of 
neural networks have been investigated in detail with great success [Hertz 
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et al 1991]. But in my talk I want to demonstrate on a few examples, to 
what extend the physics approach is able to ask questions and obtain results 
which are different from the approach of mathematics and computer science. 
In particular I think that only the methods of physics can calculate discon- 
tinuous and singular properties of infinitely large networks. Hence, in this 
paper I discuss examples from attractor networks, generalization, structure 
recognition, Bayesian estimate, on-line training, noise estimation and time 
series generation, which show discontinuous behavior as a function of model 
parameters or the size of the training set. 

This talk is not supposed to be a review. I apologize to all of my colleagues 
whose important contributions to the theory of neural networks are not men- 
tioned. 



2 Perceptron 

The simplest model of a neural network has already been introduced by 
Rosenblatt in 1960. It consists of an input layer of "neurons" Si ,i = 1, ...,N, 
which take only binary values Si G {— 1,+1}. The activity Si of each neuron 
is transmitted by "synapses" Wi E M to an output neuron a as shown in 
Fig. 1. The output reacts to the sign of the "postsynaptic potential", 

N 

a = sign J2 ^i^i = sign W S (1) 

i=l 

In the training phase this network, which was called "perceptron", receives 
a set of training examples (cr^, S'^), u = 1, ...,aN. It changes its weight Wi 
such that a maximum number of examples is correctly mapped by Eq. 
A simple algorithm has been investigated by Rosenblatt (see Hertz et al 
1991): It presents the examples in an arbitrary sequence. If an example is 
not correctly classified, that is if W{t) ■ S^a^ < 0, then 

W{t+l) = W{t) + ^ S"" -a" (2) 
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Figure 1: Architecture of the perceptron (left) and the committee machine(right). 

There exists a convergence proof for this algorithm: If the examples can be 
mapped correctly by any perceptron, Eq. ([l|) with weights W*, then the 
perceptron rule Eq. finds a solution, i. e. the algorithm stops. 

The Rosenblatt training rule stems from neurobiology, as proposed by D. Hebb 
in 1949: Each synapse reacts to the neuronal activities at its two ends. Here 
we need an additional influence of the postsynaptic potential. 

The perceptron implements a linear separable Boolean function, which has 
an interesting geometrical interpretation: W -8 = defines a hyperplane in 
an A^-dimensional space of inputs S, the weight vector W is normal to this 
plane. On the side of the vector W the perceptron classifies each input S 
to cr = +1 (black, correct, ...), on the other side the label is a = — 1 (white, 
wrong, ...) see Fig. 2. 



Now we consider a set of aN many points S'^ in dimensions. How many 
sets of labels {cr'^} can be represented by any perceptron? Surprisingly this 
problem which is important for the theory of neural computation was al- 
ready solved by the Swiss mathematician Ludwig Schlafli in the last century 
[Schlafli 1852]. If any subset of N points is linearly independent, then the 
number C of possible sets of labels {cr"} is given by 

i=0 ^ ^ 
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Figure 2: Space of input vectors S. The weight vector W of the perccptron 
defines a hyperplane in the A^-dinicnsional space, which separates the labels a of 
the input vectors S. In the shaded region the labels of teacher B and student 
perceptron W are different. 
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Figure 3: Two dimensional projections of 250 points in 200 dimensions. The 
points are labelled randomly. The perceptron algorithm finds a hyperplane which 
separates different labels. 

For a < 1 all labels can be produced by a perceptron, i. e. C = 2"-^. 
For a < 2 there is a large fraction of labels which can be separated by 
a hyperplane. For a > 2 only a tiny fraction of patterns can be stored, 
this fraction disappears for A?" — > oo. This result has consequences for the 
associative memory which will be discussed in the following section: In the 
limit of A?" — > oo a network with N neurons can store up to 2N random 
patterns. 

The geometry of this result is shown in Fig. 3. 250 points are located in 



7 



Figure 4: A set of six perceptions is connected to form an attractor network. 



a 200 dimensional space and randomly colored black or white with equal 
probability. Now we are moving in the space of points and would like to 
find a view where black and white is clearly separated by a single gap. From 
Eq. d) we find with = 200 and aN = 250 : 0/2"^^^ ~ 1 - 4 ■ lO^^s. ^-^^^ 
means for random labels it is almost sure that one can find such a view. In 
fact the Rosenblatt algorithm, Eq. found the solution shown in Fig. 3. 

3 Attractor networks 

The perceptron is the "atom" of all neural networks. Many of such elemen- 
tary units can be composed to a large and complex network. Here we consider 
an attractor network which consists of neurons Si as before. But now each 
element Si is connected to any other element 5*^ by a coupling Wij G M, as 
illustrated in Fig. 4. 

We want to store aN many patterns S'j' G { — 1,+!} ;i = 1,..,A^ ; = 
l,...,aA^. If we use the Rosenblatt rule, Eq. (0), without the additional 
condition, we obtain the Hebbian couplings 

V 

Since each input Sj is output of a perceptron with weights Wj^, we can define 
a dynamics of the configuration of neurons, S. For instance, for each neuron 
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Si we consider the local field 



h, = J2W^JS,{t) (5) 
j 

where t is a discrete time index. Now we define a stochastic dynamics by the 
probability P to find neuron Si in the state S e {+1,-1} in the next time 
step t + 1: 

/? is a parameter which measures the noise level of the dynamics. For (3 —>■ oo 
we obtain the noiseless deterministic equation 

Si{t + l)^signY,WijSj{t). (7) 
j 

This model was introduced by Hopfield [1982]. He noticed that the dynamics 
of the neurons is nothing else than the usual Monte Carlo procedure to obtain 
thermal equilibrium. Since the couplings are symmetric, Wij = Wji, the 
stationary state is given by a Boltzmann distribution 

P{S) = eM-PH{S))/Z (8) 

with a Hamiltonian 

H{S) = ~J2^,,S,S, (9) 

This is the main advantage of equlibrium statistical mechanics: The dy- 
namics S{t) is replaced by a summation over all possible states S. Instead 
of solving a system of N strongly coupled nonlinear equations, one has to 
calculate the partition sum 



Z = ^cxp 

{S} 



(10) 



In the thermodynamic limit of infinitely many neurons, N ^ oo, and in- 
finitely many patterns, a =const., the partition sum Z was solved exactly 
by Amit et al [1987] using the replica method. There are two main steps in 
the calculation: 
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Figure 5: An initial state of an attractor network relaxes to one of the stored 
patterns. (Prom Kinzel 1985) 

1. The free energy / = — lnZ//3 is averaged over all possible sets of pat- 
terns {>S"'}. It can be shown that the average value gives the same 
results as the value / for a single, randomly chosen set of patterns. 
Hence, this calculation yields results for a typical situation. 

2. The sum over 2^ states in In Z is performed for fixed order parameters. 
The minimum of / as a function of these quantities yields their physical 
values, which describe the stationary state. Hence, the complex system 
of interacting neurons is described exactly by a few order parameters. 

The first step is done using the replica method: 

(lnZ){s''} = lim|^(Z"){sn- (11) 

In the second step one is interested in the overlap between the state S and 
one of the patterns . Let us assume that the first pattern has the form 
of an "A" as shown in Fig. 4 with = 400. The other 31 patterns consist of 
random bits. If the initial state 'S'(O) has an overlap to the first pattern, for 
instance if it is the noisy "A" of Fig. 5, then after a few steps given by Eq. (|^) 
the network relaxes to the stored information more or less completely. 

The statistical mechanics gives information about the possible final states of 
the dynamics. Here we are interested in the overlap after the relaxation: 

rriA = 5^. (12) 
10 
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It turns out that one obtains this order parameter if one calculates the free 
energy. The overlaps 

to the other 31 patterns arc of the order of l/\fN. However, their sum r is 
an additional order parameter 

^ aN 

rv f * 

i/=2 

r measures the fluctuation of the final state to the rest of the patterns. 
Finally there is an order parameter q which measures the complexity of the 
space of possible solutions S. Like in the theory of spin glasses, it signals an 
additional order of the stationary states which has no simple interpretation 
[Fischer and Hertz 1991]. 

The theory of attractor networks has close similarities to the theory of an 
Ising ferromagnet with infinite range interactions. In both cases energy and 
entropy can be expressed in terms of order parameters. For the ferromag- 
net one obtains an implicit equation for the spontaneous magnetic order m 
[Yeomans 1992] 

m — tanh /3Jm . (13) 
For the attractor network one finds 

vfiA — (tanh(/?TOA + (5\/arz))z (14) 

where the average is performed over a Gaussian distributed quantity z. There 
are additional equations for r and q. Hence, compared to the ferromagnet, 
the aN — 1 patterns add a noise term to the local fields. 

Fig. 6 shows the result of the analytic calculation [Amit et al 1987]. In 
the noise-load plane one obtains several phases, which are well separated 
in the thermodynamic limit. For strong noise, T = l//3>1, or for high 
load, a > 0.14, the network cannot recognize the stored patterns at all. 
Nevertheless there is a spin glass order for low noise T < Tg with g > 0. 
Only for T < ^^(q;) the network can relax to final states which have an 
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Figure 6: Schematic phase diagram of the Hopfield model. (Prom Amit et al, 
1987) 

overlap to one of the stored patterns. This overlap jumps discontinuously to 
zero at T^. For T < Tc{a) this retrieval state has the lowest free energy, i. e. it 
is thermodynamically stable. Note that also for the deterministic dynamics 
{T — 0) there is a discontinuous drop to zero retrieval at a = etc — 0.14. 
For a < ac the network restores stored information very well, for a > ac 
the network relaxes to final states which have nothing in common with the 
stored patterns. 

According to Schlafli there are couplings with a storage capacity of etc = 2, 
but these interactions are not symmetric and one cannot apply statistical 

mechanics to solve the corresponding attractor network. A network with 
Q!c = 1 has been analysed by Kanter and Sompolinsky [1987]. 

In summary, the attractor network functions as an associative memory. It 
is a distributed memory, since all patterns are stored in all couplings. It 
is content addressable, since a state with a partial information relaxes to 
the complete one. Even with a stochastic dynamics it performs well, if the 
noise level T and storage capacity a are not too high. There is a sharp, 
discontinuous transition between good and zero performance. Processing of 
information emerges as a cooperative effect from a large number of simple 
interacting units. 
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4 Generalization 



We have seen how a network of mutually interacting perceptrons can work 
as an associative memory. But already the simple perceptron itself has in- 
teresting properties. It can learn from examples and recognize an unknown 
rule. 

Consider a perceptron with a weight vector W, as in Eq. (g). We will call this 
perceptron the "student". It obtains a set of examples (a^, 5"^), u = 1, aN 
from a "teacher". In the simplest case the teacher is another perceptron 
with a weight vector B. To what extent can the student gain information 
about the vector B if the only available information is the set of aN many 
examples? The patterns S" are selected randomly and cr^ is the output of 
the teacher, 

= sign B ■ S" (15) 

As before we are interested in the limit N ^ oo and a =constant. 
We have to consider two processes: 

1. The training phase: 

The student network is trained by use of the examples, it tries to de- 
crease the training error 

aN 

e,{W) = Y, ©[-^'^■^b] (16) 
i/=i 

6(x) is the step function, it is zero if the student reproduces the ex- 
ample S'^ correctly. 

2. The test phase: 

Now the student receives an input S which has not been presented 
before. It gives the answer a = sign W ■ S, which may be different 
from the answer by the teacher, as = sign B ■ S. The probability of 
disagreement or the generalization error is defined by an average over 
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all possible input vectors S: 

e^={e[-aaB])s (17) 

From Fig. 2 one can see that Eg is determined by the angle between the 
weight vectors of the student and the teacher perceptron 

1 B W 

Eg = — arccos ■ 



TT \B\\W\ 



Training and generalization of the perceptron has been studied in detail using 
methods of statistical mechanics [see e.g. Watkin et al 1993, Opper and 
Kinzel 1996] on a simple scenario where the weights are restricted to binary 
values, Wi G {+1,-1} and Bi G {+1,-1}. The student perceptron is trained 
by a stochastic algorithm, for instance by a Monte Carlo procedure similar 
to Eq. (D). But now we have a stochastic dynamics of the synaptic weights 
Wi instead of the neurons Si, which leads to a thermal equilibrium given by 

P{W)=eM-P^t{W)]lZ (19) 

As before we do not have to solve the complex nonlinear dynamics of the 
weights W{t) but rather calculate the partition sum 

Z=Y, expM £i(W)] (20) 

{W} 

Again we have to perform two steps: 

1. Average InZ over all possible sets of examples {>S"'}. 

2. Evaluate the sum of 2^ states W by introducing order parameters. 

In the limit of large noise, /5 — > 0, the calculation turns out to be very 
easy [Seung et al 1992]: The only order parameter is the overlap R between 
student and teacher, 

R = ^B.W (21) 

The training error of Eq. ( pO]) can be replaced by the generalization error 

a 

a Eg = — arccos R (22) 

TT 
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0.0 1.0 2.0 3.0 Pa 

Figure 7: Generalization error £g as a funetion of the size a of the training 
set (schematic). The binary perceptron is trained stochastically for large noise 
(/3 — 0). The dotted line describes the first order phase transition to perfect 
generalization in thermal equilibrium. The solid line extends to a metastable 
state. (Prom Seung et al 1992) 

and the entropy is the well known mixture entropy of binary variables 

S{R) = h{l + R) ln(l + R) + {1-R) ln(l - R)] + ln2 (23) 

R is determined by the minimum of the free energy 

f{R) = aeg{R) - TS{R) (24) 

Note that the product Pa in the limit /3 — > is the only parameter of the 
model, hence a large noise has to be compensated by a large number of 
examples. 

One minimum of / is always i? = 1, i. c. , the student perfectly recognizes 
the teacher. However, for f3a < 2.08, the system has an additional minimum 
at i? < 1 which is the global one for Pa < 1.69. Fig. 7 shows the gener- 
alization error as a function of the fraction of learned patterns. There is a 
discontinuous transition from poor to perfect generalization, similar to a first 
order phase transition in solid state physics. Both of the transitions are char- 
acterized by metastable states and hysteresis loops. This process of sudden 
recognition appears even for a noisy training algorithm. A replica calculation 
shows that the transition qualitatively extends to zero noise T — 0. [Seung 
et al, 1992] 
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5 Multilayer networks 



We have already seen how an attractor network can be built from many per- 
ceptrons. Another interesting system which can be constructed from many 
elementary units is a multilayer network, shown in Fig. 1. It consists of 
several layers of synaptic weights which map the information coded in the 
neurons from top to bottom. It is important that such networks can realize 
any function, if the number of hidden units (neurons in the interior layers) 
is large enough. 

The simplest multilayer network is a committee machine. It consists of N 
input units, K hidden units, K weight vectors Wi,i — 1,...,K and one 
output unit a. The weights of the second layer have the value +1, that 
means, that the output bit a is given by the majority of the K perceptrons 
(= members of the committee) in the first layer. 



This network is trained from a set of examples (cr^, S'^), u — 1, aN. Note 
that the opinion of the majority is trained, not the opinion of each member 
of the committee! 

Here we consider the case, where the student is a committee machine with 
K = 3 members. The teacher is a simple perceptron with single weight vector 
B. All the weights are assumed to be binary, Wik, Bi G {+1, — !}• To what 
extent can a complex network gain information about a simple rule from a 
set of examples? 

In analogy to the previous section we consider a stochastic training algo- 
rithm. The training error is the "energy" of the Gibbsian probability 
which describes the stationary state of the stochastic algorithm. In the limit 
of high noise there are several order parameters, which determine the energy, 
entropy and generalization error. Firstly, there are the overlaps between the 




(25) 
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Figure 8: Generalization error Sg as a function of the size of the training set 
(schematic). A committee machine with binary weights and three hidden units is 
trained to a set of examples given by a binary perceptron. (Prom Schwarze et al 
1992) 

members of the committee and the teacher, 

Ri^^Wi-B (i = 1,2,3) . (26) 

Secondly, the weight vectors of the committee have a mutual overlap 

Q^J = ^W,W, (27) 

Their physical values are determined by minimizing the corresponding free 
energy. 

Fig. 8 shows the result of the analytic calculation [Schwarze et al 1992]. For a 
small size of the training set (/3a small) the members of the committee react 
symmetrically, Ri = R2 = R3 < 1 and the generalization error decreases 
continuously with a. By increasing the size of the training set, suddenly 
one of the members recognizes the teacher perfectly, Ri — 1, R2 — Rs < 1, 
and the error jumps to a low value. Further increase of fia leads to another 
discontinuous transition to perfect recognition of the majority. Since the 
majority vote is already determined by two members, the highest entropy is 
achieved for Ri = R2 = 1 and R3 — 0. 

Here again the competition between energy and entropy leads to an inter- 
esting discontinuous behavior of the generalization ability. Such sharp tran- 
sitions, which occur for infinitely large networks, only (N — > 00), are not 
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Figure 9: Parity machine with a tree architecture. Each of the three weight 
vectors Wi is connected to only one third of the input vector. The output bit is 
the product of the three hidden units. 

obvious. One needs the tools of statistical mechanics to find and describe 
them. 



6 Parity machine 

Now we want to discuss another multilayer network, the parity machine with 
a tree architecture shown in Fig. 9. It consists of N input and K hidden 
units. The input units are grouped into K parts with N/K neurons each. 
Each part is input of a perceptron with weights Wi {i — 1,...,K). The 
output of the whole network is given by the parity of the outputs of the K 
perceptrons, 

K 

a^H sign {Wj ■ S) (28) 

We consider the case, where both of the student and teacher networks are 

a parity machine with the same number of units. The teacher network is 
presenting a set of examples given by 

K 

= n sign(S,- • S'') {i^ = l,..., aN) (29) 

The examples should be learned without errors. In this case we are interested 
in the volume V of all student vectors {Wi, Wk} which learn the set of 
examples {((7^, S")} perfectly. V is an integral over a A^-dimensional space 
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Figure 10: Generalization error as a function of the size of the training set for the 
parity machine, which learns perfectly a set of examples given by a teacher parity 
machine. From left to right: K = 1,2,3 and 4. (From Opper 1994) 

and corresponds to the partition sum Z of the previous sections. The method 
of the calculation is similar as before: 



1. Average V over all possible sets of inputs {S'^} using the replica method. 

2. Calculate the integrals by introducing order parameters R and Q, sim- 
ilar to Eqs. (EBp and (P^). 



In general, an additional average of V over all possible teacher vectors Bi is 
to be performed. 

The generalization error is shown as a function of the number of learned 
examples in Fig. 10 [Opper 1994]. It reveals unexpected properties of the 
network: For a large fraction of examples, < a < ac{K), the network 
cannot generalize at all {Sg = 1/2), although it stores of the order of N 
examples perfectly! Zero training error does not imply an overlap between 
student and teacher network, even for a > 0. 

If the number of examples is increased to a critical threshold ac{K), then the 
student suddenly recognizes the rule. Eg jumps to a low value and decreases 
asymptotically as 1/a, independently of the number K of hidden units. This 
property is another surprise: The asymptotic behavior is not determined by 
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the Vapnik-Chervonenkis dimension, which diverges as InK [Barkai et al 
1990]. 



7 Structure recognition 

Up to now we have discussed supervised learning, that means the input 
patterns S'^ have the labels a". But there are many applications of neural 
networks where the labels are not given. In these cases the task is to detect 
a structure in the data [Hertz et al 1991]. 




a) b) 



Figure 11: Projection of a distribution of data points. In one direction B the over- 
lap has a double peak distribution (b), in all orthogonal directions it is Gaussian 
distributed (a). (Prom Biehl, 1997) 

Consider for example two clouds of aN data points as shown in Fig. 11. 
The S'^ are distributed in a A'"-dimensional space according to a mixture 
of two Gaussian distributions [Biehl and Mietzner 1993]. This means, that 
there is a direction B in data space where the projections = B ■ / \/N 
have a double peak distribution. In any direction W orthogonal to B the 
corresponding projections Gaussian distributed with a single peak, as 

shown in Fig. 11. Note that the lengths of all vectors , W and B are of 
the order of A^, while the overlap • B is of the order of y/N. 

Given the aN many data points, we want to find the direction B. There 
exists a method, well known in engineering, which is called "Principal com- 
ponent analysis" and determines the direction of maximal variance in data 
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space [Hertz et al 1991]. In fact there is an algorithm for neural networks 
which finds this direction [Oja 1982]. For our example this means, that we 
want to find a direction W which minimizes 

aN 

H{W) = -J2iW-S''f/N (30) 

The minimum of H can be found by calculating the partition sum 

Z = J d^W 6{W^ - N) exp{-(3H{W)) (31) 

in the limit of /3 — > oo. Here we have again replaced the dynamics of the 
algorithm by a summation over all possibilities. As before we have to average 
InZ over all possible data points S'^. The evaluation of the A^~dimensional 
integral in the limit N ^ oo yields the order parameter 

R=j^W-B (32) 




Figure 12: Overlap R between the direction B of Fig. 11 and the weight vector 
W of the training algorithm. The theory (solid line) shows a transition from zero 
to nonzero recognition with increasing number of data points in the limit N ^ oo. 
The Monte Carlo simulation (points) for N = 1000 show that finite size effects 
smoothen the transition. (Prom Biehl and Mietzner, 1993) 

The result of this calculation is shown in Fig. 12 [Biehl and Mietzner 1993]. 
Surprisingly one observes a sharp phase transition. For a small number of 
data points the system cannot recognize the direction B of separation at all. 
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Above a critical number ac the symmetry H{W^ = H{—W) is spontaneously 
broken: \R\ increases with the deviation from ac similar to the magnetization 
in a ferromagnet. 

Using the concepts of energy, partition sum and order parameter an unex- 
pected sharp transition from zero to good performance was found for the 
standard method of principal component analysis. 

8 Bayesian estimate 

In the previous section we have found the structure of a data distribution 
by minimizing a cost function. However, if one knows something about the 
structure of the data it is more efficient to include this knowledge into the 
algorithm. Here we want to discuss this problem for the two overlapping 
clouds of data considered in the previous paragraph, see Fig. 11. 

The distribution of the data points S'^ is given by 



This distribution has two parameters: The vector B of length which gives 
the direction of the cloud separation and the distance p between the centers 
of the clouds. 

Now let us assume we know the form of the distribution. Eg . (|33|) , and want 
to estimate its parameters B and p. Hence our model is for a given distance 



Given the set of data points S'^,i' = 1, aN, the a posteriori distribution 
of directions W is given by the Bayes relation 

^ aN 

pmis"}) = zU - ^) ^M-ms''; w,p)] (35) 

u=l 




oc 




(33) 




(34) 
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There are several possibilities to estimate a direction W from this distribu- 
tion [Biehl 1997]: Their performance can be measured by the order parameter 

R = j^W-B (36) 

which has a single value in the limit ^ oo. For example one may maximize 
the a posteriori distribution with respect to W. The "maximum likelihood" 
corresponds to the minimum of 



aN 




which can be studied by calculating Z for — >■ oo, using the replica method 
[Barkai and Sompolinsky 1994]. For small cloud separation p the maximum 
likelihood solution coincides with first principal component. 
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Figure 13: Phase diagram for Bayes estimation of the direction B of the two 
clouds of data points in Fig. 11. a measures the number of data points and p is 
the estimated distance of the clouds relative to the true p = 1- (From Biehl 1997) 

Another possibility is to select a direction W according to the a posteriori 
distribution, Eq. (^3]), for instance by using the Monte Carlo method [Watkin 
and Nadal 1994]. The result is obtained from calculating Z for (3 = 1, again 
by averaging In Z over the true distribution of data points. The result of this 
Gibbs estimate is shown in Fig. 13 [Biehl 1997]. In the plane of a, the size 
of the data set, and p, the estimated distance between the clouds, there are 
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sharp phase transitions. Recognition {R > 0) appears only for a sufficiently 
large number of data points. It is better to use an estimate p which is larger 
than the true one p = 1, since for small p the critical fraction diverges as 



9 On— line training 

In the previous sections all of the examples were presented in the training 
phase of the neural network. The algorithm used the training error with 
respect to all of the examples in order to find the synaptic weights of the stu- 
dent network. For instance for the Rosenblatt algorithm, Eq. (|^) all examples 
have to be predented several times before the algorithm stops. 

Now we want to consider a different training algorithm. At each step only 
one new example is presented. One does not have to store the complete set 
of the examples, but the present weight vector W is changed due to one 
new example (cr^, S^). It turns out that such an "on-line" training is more 
efficient in terms of computational effort than the "off-line" or "batch" rules 
of the previous sections, if there are enough examples available. 

On-line learning leads to a stochastic differential equation for the weight 
vector Wlu), which becomes a deterministic one for several order parameters 
in the limit N ^ oo [Biehl and Schwarze 1995, Saad and SoUa 1995]. Usually, 
the dynamics of on-line learning is not described by a Hamiltonian or a 
partition sum, nevertheless there are discontinuous properties as a function 
of model parameters. 

Let us consider a two layer network with continuous output neurons. The 
student as well as the teacher network have three hidden units with weights 
Wi and Bi, the transfer function of the hidden units is the error function. 
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For simplicity the output neuron is linear with fixed weights 

3 

a = J2 erf{WiS/V2) (38) 

i=l 

The teacher presents aN many examples given by 

3 

(^B = Y1 e^KBiS^/v^) (39) 

i=l 

The error of a single example is defined as the quadratic deviation 

e{Wi, W2, Ws, S") = \{a''- a^B? (40) 

In analogy to backpropagation [Hertz et al 1991], the change of weights is 
proportional to the gradient of the training error of a single example: 

Wu{v-rl)^W,{u)-^ Vw. e (41) 

Prom this equation a system of first order, nonlinear and coupled differential 
equations can be derived for the set of order parameters 

Qjk = ^W,Wk (42) 

In our case there are 15 order parameters which change after each presen- 
tation of a new example. In the limit of A?^ — > oo the index u becomes a 
continuous "time" a. Hence, one has to calculate the fiow of Rjk{a), Qjk{oi) 
in the 15 dimensional space of order parameters which determine the gener- 
alization error Sg{a.). 

It turns out that there arc several fixed points of this flow, which have im- 
portant consequences for the behavior of the generalization error. Fig. 14 
shows a typical example. For a small number of examples, Sg decreases fast. 
But then the generalization error almost does not change for a long training 
period. Suddenly it decreases to good performance of the network. 

This plateau of Sg{a) which is observed in applications, too, can be under- 
stood in terms of the fiow of order parameters [Biehl et al 1997]. There is one 

25 











a 



Figure 14: Generalization error as a function of the size of training set for a two 
layer network (schematic). The plateau is related to a fixed point with one weak 
repulsive direction for the flow of order parameters. (Prom Biehl et al 1997) 

fixed point which is strongly attractive in almost all directions. But in one or 
a few directions it has a small repulsive component. Hence, the flow remains 
for quite a while (depending on initial conditions) close to this fixed point 
with a large generalization error, but then it flows away to the completely 
attractive flxed point with zero error. 

The number of fixed points depends on the learning rate rj of the trainig rule. 
In our simple example there are at least ten different flxed points for = 1. 
With increasing rj some flxed points split into two, which usually means 
that some symmetry is broken. Fixed points suddenly appear, disappear 
or annihilate with varying learning rate f]. Such discontinuous behavior is 
reflected in the generalization error Sg{a). 

10 Noise estimation 

The examples, given by a teacher network, may have errors. To what extend 
can a student network derive information about the teacher weights from a 
set of faulty examples? This problem has been investigated in detail [CopeUi 
et al 1997]. We consider a committee machine with a tree architecture 



K 




(43) 



i=l 



26 



The student as well as the teacher network have the same number K of 
hidden units. The examples are distorted by noise: The bit cr^ is flipped 
with probability A. For X — 1/2 there is no information in the examples, but 
for < A < 1/2 the student network may obtain an overlap to the teacher 
one with increasing number a of examples. 

We study the training algorithm 

Wkiiy + 1) = Wk{u) + ^FkSk (44) 

Instead of a parameter rj we have introduced a function which is deter- 
mined from a variational principle which maximizes the decrease of general- 
ization error £g at each training step [Kinouchi and Caticha 1992]. Hence, 
one can define an optimal weight change, which depends on the order pa- 
rameters. It also contains the noise rate A which is not known in general; it 
has to be estimated by a value A, Fk{X) is replaced by Fk{A.). 
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Figure 15: On-line training with optimal weights. A is the estimated and A the 
true noise level of the training examples. Above the solid line perfect generalization 
is possible, eg{a — > cxd) = 0. Below the dashed line the network cannot generalize 
at all. Between the dashed and solid line a partial recognition of the teacher 
network is possible. (Prom Copelli et al, 1997) 
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The generalization error has been calculated for optimal on-line learning 
[CopeUi et al 1997], Fig. 15 shows the result. There are sharp boundaries 
in the (A, A) plane where the behavior of the network changes drastically. 
Fastest decay of eg{a) is obtained if the true noise is estimated correctly, 
A = A, as expected. If the estimated noise parameter A is large enough, 
then eg{a) still decreases to zero in the limit of an infinite number a of 
examples. However, if A is below the dashed line, then the network cannot 
generalize at all. If A lies in the intermediate region then the generalization 
error decreases to a nonzero value; the network can generalize only partially. 
Again we observe sudden changes in the behavior of the network as a function 
of model parameters. 



11 Time series generalization 

Most of the work on the statistical physics of neural networks has been done 
on static data. A set of input vectors {S"} is taken from a static distribution 
and classification labels {cr"} are taken from a static rule. Only recently this 
research program has been extended to the analysis of time series [Eisenstein 
et al, 1995], which is an important field of applications of neural networks. 




Figure 16: A perceptron working as sequence generator. The sequence Si of 
numbers is generated by a perceptron moving to the right. 

In the simplest case our elementary unit, the perceptron, is trained to a 
sequence Si of real numbers, where i is a discrete time. As shown in Fig. 16, 
the perceptron takes a window of numbers as input and makes a prediction 
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of the following number, 

SI = tanh[^ Yl ^'-^l (45) 
i=i 

In the training phase the weights W are changed to decrease the error {S'^ — 

In order to apply the concepts of statistical physics one needs a well defined 
sequence {5*^}. As before it may be given by a teacher perceptron with weight 
vector B. For a given window of numbers (5'/_Ar^...^S';_i) the perceptron 
defines the number Si. Then it moves one step and generates Si+i. It turns 
out that the generation of time series is already an interesting problem with 
many unsolved puzzles [Eisenstein et al 1995, Kanter et al 1995, Schroder 
and Kinzel 1997]. 

The numerical investigation of the sequence generator shows that an initial 
state of Eq (^5|) approaches a quasi periodic attractor which is related to 
a peak in the Fourier spectrum of W [Eisenstein et al 1995]. Hence, the 
perceptron selects one mode of the couplings. Therefore it is useful to study 
couplings with a single Fourier component 

Wj = cos(27rfc ^ - 7r0) (46) 

k is the frequency and (p the phase of the weights. An attractor of Eq. (|45|) 
is the solution of 

Si = tanh[^ J2 cos(27rA;^ - 7r0)5,.,] (47) 
i=i 

Recently this equation could be solved analytically [Kanter et al 1995]: For 
small values of (3 the attractor is zero. Si = 0. For a critical value, which is 
independent of the frequency k, 

A =2-^ (48) 

sm 7r(p 

there appears the solution 

Si = tanh[A(/?) cos(27r(A; + 0)-^)] (49) 
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The phase of the weights shifts the frequency A; + of the solution. The 
amphtude A{P) is a continuous function of the distance /5 — /5c > to the 
critical point. For f3 ^ oo tanh is replaced by sign and the sequence 
generator becomes a bit generator. In this case the solution, Eq. (^), is 
more complicated, but again the phase (p shifts the frequency of the bit 
sequence [Schroder and Kinzel 1997]. 
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Figure 17: Return map of a sequence generated by a multilayer network with 
two hidden units. The one-dimensional attractor in the middle becomes a two- 
dimensional one, if the parameter /3 is increased. (From Kanter et al, 1995) 



If the network is a multilayer perceptron with K hidden units, then the 
attractor of the sequence generator is a kind of superposition of the single 
modes of each hidden unit [Kanter et al 1995]. Each unit has ist own critical 
point and the number of nonzero solution determines the dimension of the 
attractor. This is shown in the return map of Fig. 17 where 5";+! is plotted 
against Si for K = 2 hidden units. Increasing f3 first one component is 
activated leading to a one dimensional attractor shown in the middle of 
the figure. Since in general the frequency is irrational, the attractor is a 
continuous curve. For larger values of (3 the second unit is activated, giving 
the two-dimensional attractor with larger amplitude. 
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12 Summary 



Concepts of statistical physics have successfully been applied to the theory 
of neural computation. The cooperative behavior of a large number of in- 
teracting neurons can be described in terms of partition sums and order 
parameters. The competition between training error and entropy may lead 
to discontinuous properties. 

The approach of statistical mechanics has several advantages: 

1 . Networks with an infinite number of neurons and synapses can be calcu- 
lated analytically. Complex cooperative behavior of interacting neurons 
is described in terms of a few order parameters. 

2. The results are obtained for a typical situation, for instance for the 
most general set of examples. 

3. One obtains exact mathematical relations between the observed coop- 
erative properties of the network, its model parameters and the size of 
the training set. 

4. Many networks and algorithms show discontinuous properties as a func- 
tion of model parameters or the number of presented examples. Sta- 
tistical physics can describe such sudden changes in the cooperative 
behavior of the network. 

Statistical mechanics of neural networks has been applied to a variety of prob- 
lems; we just want to mention learning from examples, generalization, asso- 
ciative memory, attractor networks, structure recognition, clustering, classi- 
fication, coding and time series analysis. For all of these problems general 
properties have been calculated mathematically. Novel and unexpected re- 
sults have been found. Hence, 1 think that in the last 15 years theoretical 
physics has successfully contributed to our understanding of neural networks, 
with impact on neurobiology and computer science. 
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